Automatic Extraction from Scientific Abstracts of Synonyms for Proteins and Genes

نویسنده

  • Hong Yu
چکیده

Introduction: Protein and gene names change frequently as research reveals details about these entities. 1 Because authors often use synonyms, information retrieval requires identification of these alternate names. Many biological databases — such as GenBank 2 and SWISSPROT 3 — have synonym databases; however, the databases may not be complete. Furthermore, to our knowledge, the extraction of synonyms is mainly done by laborious manual curating and review. It is desirable to automate the process due to the enormous volume of publication. We observed that many scientific abstracts have summaries of synonyms. The synonyms are often specifically proposed or mentioned and may be classified into a set of patterns to be recognized by automation. Methods: We manually classified and evaluated patterns used by authors to represent synonyms of proteins or genes in scientific abstracts. We implemented a program, SRE (for synonym recognition and extraction), to recognize and extract the terms associated with the patterns. SRE is written in Perl. The output of SRE is sets of two or more synonyms. We applied SRE to 2,312 scientific abstracts, a subset of abstracts we downloaded from PubMed by the keyword " human. " We then evaluated the precision of SRE's results, using our own judgment as the standard. Precision is the number of correct sets of synonyms of proteins or genes divided by the total sets of terms retrieved. Results: We classified several patterns that express synonyms of proteins and genes in scientific abstracts. The simplest patterns are " synonym " or " a synonym of, " such as in " Thermoactinomyces candidus should be considered a synonym of Thermoactinomyces vulgaris…, " 5 where synonyms Thermoactinomyces candidus and Thermoactinomyces vulgaris can be extracted as noun phrases before and after the string "a synonym of." To evaluate whether the patterns of " synonym " and " a synonym of " would help us to find synonyms of proteins or gene names, we retrieved all the PubMed abstracts that contained the keyword synonym and manually analyzed whether the associated terms are proteins or genes. A search on the keyword synonym for abstracts from 1966 to present retrieved a total of 540 abstracts. A subset of 30 randomly selected abstracts contained no protein or gene names; in most cases, terms were names of species. We therefore discarded this approach. " Called " and " known as " are frequently used to introduce synonyms (" ...Apo3 (also …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic extraction of gene and protein synonyms from MEDLINE and journal articles

Genes and proteins are often associated with multiple names, and more names are added as new functional or structural information is discovered. Because authors often alternate between these synonyms, information retrieval and extraction benefits from identifying these synonymous names. We have developed a method to extract automatically synonymous gene and protein names from MEDLINE and journa...

متن کامل

Identifying Sections in Scientific Abstracts using Conditional Random Fields

OBJECTIVE: The prior knowledge about the rhetorical structure of scientific abstracts is useful for various text-mining tasks such as information extraction, information retrieval, and automatic summarization. This paper presents a novel approach to categorize sentences in scientific abstracts into four sections, objective, methods, results, and conclusions. METHOD: Formalizing the categorizati...

متن کامل

تحلیل توزیع و تمرکز کلیدواژه‌های پارساها: میزان تطابق با توصیفگرها، عنوان، و چکیده

Index terms provided by authors and professional indexers are used in traditional information retrieval schemes. However, abstracts ideally contain the core message of a document. This can potentially give us the opportunities to use the abstracts to automatically extract index terms. The purpose of this work is to be used as a base or as the first stage in the automatic keyword extraction as w...

متن کامل

Playing Biology's Name Game: Identifying Protein Names in Scientific Text

A growing body of work is devoted to the extraction of protein or gene interaction information from the scientific literature. Yet, the basis for most extraction algorithms, i.e. the specific and sensitive recognition of protein and gene names and their numerous synonyms, has not been adequately addressed. Here we describe the construction of a comprehensive general purpose name dictionary and ...

متن کامل

Finding High-Frequent Synonyms of A Domain-Specific Verb in English Sub-Language of MEDLINE Abstracts Using WordNet

The task of binary relation extraction in IE [3] is based mainly on high-frequent verbs and patterns. During the extraction of a specific relation from MEDLINE English abstracts, it is noticed that besides the high-frequent verb itself which represents the specific relation, some other word forms, such as the nominal and adjective forms of this verb, as well as its synonyms, also play a very im...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001